Instructions

The final exam consists of 10 programming exercises, each representing one module from the course. In order to receive a base-level pass, you must successfully complete 5 of the 10 exercises; in order to receive an advanced-level pass, you must successfully complete 8 of the 10 exercises.

Your final file(s) should be submitted via the dropbox on the course page (under the Capstone Project section of the site). If you have any questions during the exam, you should not hesitate to ask the instructor.

By taking this exam, you assert that you neither gave nor received unauthorized aid during the course of the exam.

Background

The ability to easily store and access data is changing the way that cities conduct government business. Local governments collect data on all utilities they oversee, including traffic stops, utility usage, etc. Many cities across the world have decided to make their data open to the public. Bloomington, Indiana is one such city (https://data.bloomington.in.gov/).

When citizens have non-emergent requests for city services (trash can was not picked up, street light out), they contact city governament. This is typically done by dialing “311.” Open311 is a protocol for collecting and reporting data falling into this category. All service calls made in 2015 has been made available by Bloomington. We will be working with this data throughout. The necessary datasets available to complete the final exam are available on the course page (FinalData.RData). Assuming the RData file has been placed in the same directory as your R code, you can load these datasets using the following command:

load("FinalData.RData")

This loads the following elements:

You will also need the following file, which is on the course site.

Problem 1 (Data Structures)

Consider the following two questions:

To address these questions, construct the following statistics for each day of the week:

Specifically, construct code which results in the following summary:

Weekday Requested Number of Calls Percent Parks and Rec
Sunday 186 2.1505376
Monday 1792 0.2232143
Tuesday 1750 0.4000000
Wednesday 1237 0.2425222
Thursday 1026 0.1949318
Friday 419 0.7159905
Saturday 135 0.0000000

Problem 2 (Functions)

One of the variables in open311.df is Distance, the distance of the request location from the city center. This was computed using the “haversine” formula to calculate the great-circle distance between the two locations — the shortest distance over the earth’s surve “as the crow flies.” The haversine distance, \(d\) is given by the following formula:

\[ \begin{aligned} x &= \sin^2\left(\frac{\phi_2 - \phi_1}{2}\right) + \left[\cos\left(\phi_1\right) \cdot \cos\left(\phi_2\right) \cdot \sin^2\left(\frac{\lambda_2 - \lambda_1}{2}\right)\right] \\ y &= 2 \cdot \text{atan}_2\left(\sqrt{x}, \sqrt{(1-x)}\right) \\ d &= R \cdot y \end{aligned} \]

where

Write a function haversine() which returns the haversine distance and takes the following parameters:

Note, the function should be vectorized over the parameters lat1, long1, lat2, long2; that is, these should accept vectors as arguments.

The latitude and longitude (in degrees) for the city center of Bloomington, Indiana is (Latitude: 39.1653, Longitude: -86.5264). You can test your function by recreating the Distance column of open311.df.

Problem 3 (Graphics)

We would like to graphically examine the relationship between the time required to complete a service request and the distance the request is from the city center. In addition, we would like to dtermine if this relationship tends to differ depending on the day the service was requested. This could be captured in the following graphic:

You are to recreate the above graphic. Note the following elements should be included:

Problem 4 (Strings)

When a call is taken, the operator generally provides a description of the request being made; in addition, the type of service requested is determined (there are 45 unique names given to service calls). For all calls received (regardless of whether the status is open or closed), determine the following:

Problem 5 (Data Input/Output)

We have been working with a fairly clean dataset. However, this is not how the data is available online. The file open311.xml is the original file downloaded from the website (see above for link). You are to use this file to construct a version of the open311.df dataset. In particular, construct a dataset MyOpen311.df (from the open311.xml file) which contains the following variables:

Your dataset, if constructed correctly, will match the records of open311.df.

Problem 6 (Dynamic Graphics)

Consider the following logical questions you might ask with this dataset:

To get at this, we consider an interactive graphic which allows us to examine where in Bloomington service calls are originating. You are to replicate the following graphic.

Note the following attributes of the graphic:

Problem 7 (Simulation)

The “Street Department” received 140 requests regarding “Street Lights” over the course of the year. Suppose that these calls are representative of what the Street Department in Bloomington receives regarding street lights. In particular, assume that they can be used to represent the time between such calls for this department.

Further, suppose we are willing to make the following assumption: let \(X\) represent the cost of making a repair to a street light; assume \(X \sim N\left(1900, 300^2\right)\). That is, the cost varies according to a Normal distribution with a mean of $1900 dollars and a standard deviation of $300 dollars.

Suppose the city has set aside $300,000 for repairs to street lights for the year. Will this be enough funds? Conduct a simulation to determine the following:

Your annual costs should have the following distribution (based on 10,000 simulations).

Problem 8 (Randomization-Based Inference)

The director of the Sanitation Department would like to reference this data when making requests for funding. In particular, he would like to make a statement like “the Sanitation Department is \(\theta\) times more likely to receive a request for service than the Street Department.” He could estimate this by value by considering the following:

\[\widehat{\theta} = \frac{\text{Number of Requests for Sanitation Department}}{\text{Number of Requests for Street Department}}\]

However, having had a course in statistics, he would like to present a 95% confidence interval for this estimate. Write a bootstrap algorithm to compute a 95% confidence interval for this estimate. You may assume that the calls for the year provided are a random sample of typical call volume/type.

Your interval should be roughly (based on 5000 replications):

    2.5%    97.5% 
5.011411 5.869312 

Problem 9 (Machine Learning)

We now use the few features in the Open311.df dataset to build a machine learning algorithm to predict the length of time (minutes) to close a call. Write code which does the following:

Note: due to the complexity of the response, it can take a while for the model to finish running. The MSE for your model on your test set should be similar to the following (the model is not very good):

     RMSE  Rsquared       MAE 
18726.059     0.206  7298.388 

Problem 10 (Reproducible Research)

Write your solutions to the exam using R Markdown and submit your code and any relevant output as an HTML page. A couple of things regarding your solutions: